Skip to content

GH-493: Clarify Bounding Box Behavior in GeospatialStatistics #494

Merged
rdblue merged 14 commits intoapache:masterfrom
jiayuasu:bbox
May 12, 2025
Merged

GH-493: Clarify Bounding Box Behavior in GeospatialStatistics #494
rdblue merged 14 commits intoapache:masterfrom
jiayuasu:bbox

Conversation

@jiayuasu
Copy link
Copy Markdown
Member

@jiayuasu jiayuasu commented Apr 16, 2025

Rationale for this change

Update the specification to offer clearer instructions on how writers should populate bounding box values in the presence of NaNs, and how readers should interpret missing or incomplete bounding boxes.

What changes are included in this PR?

Added comments in the Bounding Box section of the Geospatial.md.

Below is an example illustrating how bounding box statistics behave based on the presence of X and Y values:

All values in this example file are allowed by WKB encoding.

--- Parquet file

file-level bbox: [1, 9, 100, 900]

-- Row group 1
row-group bbox 1: [1, 2, 100, 100]

POINT (1, 100)
POINT (2. NaN)


-- Row group 2
row-group bbox 2: [3, 3, 300, 300]

POINT (3, 300)
POINT (NaN, NaN)


-- Row group 3
row-group bbox 3: no bbox

POINT (5, NaN)
POINT (6, NaN)


-- Row group 4
row-group bbox 4: [7, 8, 700, 800]

POINT (7, 700)
POINT (8, 800)


-- Row group 5
row-group bbox 5: no bbox

POINT (NaN, NaN)
POINT (NaN, NaN)

-- Row group 6
row-group bbox 5: [9, 9, 900, 900]

LINESTRING EMPTY
POINT (9, 900)

Do these changes have PoC implementations?

Yes, apache/parquet-java#2971 and apache/arrow#45459

@jiayuasu
Copy link
Copy Markdown
Member Author

jiayuasu commented Apr 16, 2025

@wgtmac @paleolimbot @zhangfengcdt @Kontinuation

Please review. Thank you!

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together! The bounding box went through a number of iterations when the higher level changes were being discussed and the language you've added here is at least what I think the original intention was (although this wasn't captured in the comments). I think this is the cleanest way to represent the "zero non-NaN elements" case, but we do have to make it explicit to ensure that readers can trust the statistics to skip (e.g.) all null row groups.

Copy link
Copy Markdown
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification! I think we are talking about the suggested behavior of writer and reader. What about using a table to describe this?

Values \ Behavior Writer Reader
X and Y Do not produce bbox if any X or Y is NaN Omit the bbox if any of xmin/xmax/ymin/ymax is NaN
Z Do not produce zmin/zmax if any Z is NaN Omit the zmin/zmax if any of zmin/zmax is NaN
M Do not produce mmin/mmax if any M is NaN Omit the mmin/mmax if any of mmin/mmax is NaN

By omit we mean that we should make no assumption about the values from the bbox and thus cannot be used for predicate pushdown.

Comment thread Geospatial.md Outdated
For `GEOGRAPHY` types, X and Y values are restricted to the canonical ranges of
[-180, 180] for X and [-90, 90] for Y.

When `GeospatialStatistics` is present, writers must omit zmin and zmax if and
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about putting these between line 95 and 97?

Comment thread Geospatial.md Outdated
Comment thread Geospatial.md Outdated

When `GeospatialStatistics` is present, writers must omit zmin and zmax if and
only if there are zero non-NaN Z values in the column chunk, and must omit mmin
and mmax if and only if there are zero non-NaN M values. The bounding box must
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to mention X and Y values first because they are required.

Comment thread Geospatial.md Outdated
an indication that all corresponding values are null, and may use this
information to skip data during predicate evaluation. For example, a reader may
skip a row group if the bounding box is absent, indicating that all X and Y
coordinates are null.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a null value in this case? Is WKB able to encode a null coordinate? I was thinking the binary value is null in this case.

Copy link
Copy Markdown
Member

@wgtmac wgtmac Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, a reader may
skip a row group if the bounding box is absent, indicating that all X and Y
coordinates are null.

This is counter-intuitive because usually in Parquet we cannot skip the row group if its min/max stats is missing in that we cannot make any assumption about its data. It might be the case that all values are null, or the writer does not implement this feature at all.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's the idea with this language...we need the absent-ness to be significant so that there is a path for a reader to skip an all empty/null row group, and the other ways of communicating absent-ness were also confusing (use NaNs, use Inf/-Inf).

We can also add another field like optional dimensions_that_have_zero_non_nan_values (with a less verbose name)?

Copy link
Copy Markdown
Member

@wgtmac wgtmac Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

struct ColumnMetaData {
  5: required i64 num_values;
  12: optional Statistics statistics;
}

struct Statistics {
  3: optional i64 null_count;
}

I still think that ColumnMetaData::num_values == ColumnMetaData::statistics.null_count indicates (implicitly) that the bbox should be empty. We need to fix the implementation of unknown sort order (to generate statistics in the parquet-cpp) instead of complicating the spec.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't understand that the lack of null count was a C++-specific limitation (and I was remembering a previous draft of the geospatial types that had some very strong language that writers were forbidden to write statistics and readers were required to ignore any that were present). I would be ideal if we could also prune 100% empty (but not null), but I think we can make the null count + geospatial_types to prune in most reasonable cases.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would require writing the "empty box" be enough in this case?
Basically:

  • If all values are NULL then the null count logic could kick in.
  • If there are non-NULL values which are all either empty or "invalid" then write the empty bounding box.

The the intention is that if a bounding box is empty the entire group can be skipped.

Co-authored-by: Gang Wu <ustcwg@gmail.com>
@mkaravel
Copy link
Copy Markdown

mkaravel commented Apr 18, 2025

By omit we mean that we should make no assumption about the values from the bbox and thus cannot be used for predicate pushdown.

@jiayuasu @wgtmac @paleolimbot
Is this what we really want? I would expect that if I see a NaN min/max value in the geo statistics that I would be able to safely skip the rows.

@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented Apr 18, 2025

Is this what we really want? I would expect that if I see a min/max value in the geo statistics that I would be able to safely skip the rows.

@mkaravel If writers are instructed to drop the bbox when there is any NaN value, how would you skip the rows if there are some (but not all) NaN values in the bbox? Shouldn't we consider it is malformed?

If there are non-NULL values which are all either empty or "invalid" then write the empty bounding box.

Ah, I got your point. In total there are five cases:

  • (1) All non-null geometry features are valid: produce bbox as usual.
  • (2) Some non-null geometry features are valid but other non-null values are not: produce bbox with only valid data. (Do we need an extra field to indicate there are invalid data?)
  • (3) All non-null geometry features are invalid: we need an empty bbox or yet another field to indicate there are no valid data?
  • (4) All values are null: it can be deduced from ColumnMetaData::num_values == ColumnMetaData::statistics.null_count.
  • (5) No values (a.k.a. an empty row group with 0 rows): it can be deduced from RowGroup.num_rows == 0 or ColumnMetaData::num_values == 0.

It seems that we need a clear approach to indicate there are no valid data (the case 3 above). I'm hesitant to use a bbox of all NaNs to represent an empty bbox because it complicates the logic to deal with cases where some values are NaN while others are not. Proposal from @paleolimbot to introduce dimensions_that_have_zero_non_nan_values might also complicate the case because we need extra checks and might also introduce separate fields for Z and M axises?

Is this a common case and the invalid data should be preserved as is? Or can we simplify this by writing null for invalid data to Parquet so we can eliminate case 2 and 3 above?

@paleolimbot
Copy link
Copy Markdown
Member

Thank you for the summary! I like the place we're at right now: readers can confidently skip files or row groups for all cases I envision being important. The perfectionist in me would love to handle the all-empty-but-not-null case (3) but I would be surprised if this was important (if it is, one can demonstrate a realistic scenario and propose to add something to handle that case). My proposal was hacky and I'd love to avoid it 🙂 ...it's a great point that using a "missing" thift element, an NaN, and/or Inf values for this concept are easily misinterpreted because they carry other meaning in a Parquet context.

I've updated C++'s GeoStatistics to better communicate the meaning of various combinations of empty, uncalculated, or invalid.

@jiayuasu
Copy link
Copy Markdown
Member Author

BTW, I also updated the example file in my PR description to add the LINESTRING EMPTY case. Anyway, I agree with the discussion here and I will update the PR shortly

@mkaravel
Copy link
Copy Markdown

@wgtmac I apologize for the typo in my post (fixed now, I forgot to write the word "NaN"), which has created the confusion.

The summary is great! Thank you!

@mkaravel
Copy link
Copy Markdown

I like the breakdown in the summary. For (3) my suggestion was to go for the empty box.

I agree with @paleolimbot in both aspects: it would be ideal to handle this in a nicer way, but pragmatically I do not think it matters.

I am hesitant modifying user data on-write. Is there a precedent in Parquet where something like that happens? I mean data that is invalid being written as null or omitted?

@paleolimbot
Copy link
Copy Markdown
Member

I am hesitant modifying user data on-write. Is there a precedent in Parquet where something like that happens? I mean data that is invalid being written as null or omitted?

If this is a reference to

Is this a common case and the invalid data should be preserved as is? Or can we simplify this by writing null for invalid data to Parquet so we can eliminate case 2 and 3 above?

Assuming that "invalid" means EMPTY or something with NaNs in it, I view this as something that a higher level wrapper (e.g., Sedona, Spark, or GeoPandas) might expose as an option to allow future readers of the file to have more useful statistics, and not something we would handle in any Parquet writer. Some tools generate EMPTYs, some tools generate nulls, and some computations may accidentally put nan NaN in a value. It's Parquet's job (in my opinion) to transport all of those cases faithfully to a tool that can do some validation and fix things if needed based on user input. (I think that's where we are right now!)

@wgtmac
Copy link
Copy Markdown
Member

wgtmac commented Apr 19, 2025

I agree with @paleolimbot that the Parquet writer does not change users' input and we will still see all the five cases from my summary. To make bbox of cases (2) and (3) more effective, an application that generates parquet files need to transform empty or invalid geometry features to null values so that readers reading these files will only see cases (1) (4) (5).

Adding the concept of empty bbox to the Parquet spec may complicates the reader because it cannot blindly discard bbox with NaN values.

@jiayuasu
Copy link
Copy Markdown
Member Author

@mkaravel @wgtmac @paleolimbot Hey folks, I updated my PR to reflect the latest discussion result. Can you review?

Comment thread Geospatial.md Outdated
Comment thread Geospatial.md Outdated
Comment thread Geospatial.md Outdated
Comment thread Geospatial.md Outdated
jiayuasu and others added 2 commits April 23, 2025 11:21
Comment thread Geospatial.md Outdated
Comment on lines +100 to +101
* X and Y: Skip any invalid X or Y value and continue processing the remaining X or Y
values. Do not produce a bounding box if all X or all Y values are invalid.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* X and Y: Skip any invalid X or Y value and continue processing the remaining X or Y
values. Do not produce a bounding box if all X or all Y values are invalid.
* X and Y: Skip any invalid X or Y value and continue processing the remaining X or Y
values. Do not produce a bounding box if all X and/or all Y values are invalid (even
if there are valid Z and/or M values).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just say NaN instead of invalid?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do if a geometry feature has a NaN value in its X coordinate? Should we ignore all the X/Y/Z/M coordinates from it, or just ignore that specific NaN value? It seems reasonable to do the latter.

Comment thread Geospatial.md Outdated
An invalid geospatial value refers to any of the following cases:

* `null`: A null value in Parquet.
* A non-`null` value that are encoded in a valid WKB or bounding box format
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it is better to distinguish invalid bbox and invalid WKB value?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally it might be better to inline the relevant pieces above (I've made suggestions...it is mostly just NaN). This section could perhaps be used to make non-spatial implementors aware that EMPTY geometries exist and that NaN values in geometries are rare and usually result in errors when an operation is performed (or it could be omitted).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Literally null is a valid value from the perspective of Parquet and is ignored by all kinds of min/max stats.

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this, and apologies for taking a few days to circle back here. I don't have any objections here, just some clarifications that may be helpful.

Comment thread Geospatial.md Outdated
Comment on lines +100 to +101
* X and Y: Skip any invalid X or Y value and continue processing the remaining X or Y
values. Do not produce a bounding box if all X or all Y values are invalid.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just say NaN instead of invalid?

Comment thread Geospatial.md Outdated
Omit M from the bounding box if all M values are invalid.

Readers should follow the guidelines below when examining bounding boxes. If
a bounding box is [invalid](#invalid-geospatial-values), it is treated as a `no
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a bounding box is [invalid](#invalid-geospatial-values), it is treated as a `no
a bounding box is contains `NaN` values, it is treated as a `no

(this would be the only invalid case, correct?

Comment thread Geospatial.md Outdated
Comment on lines +117 to +119
* A bounding box is present:
* X and Y: Both X and Y of the bounding box must be present. If either
is missing, the bounding box is invalid.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* A bounding box is present:
* X and Y: Both X and Y of the bounding box must be present. If either
is missing, the bounding box is invalid.
* A bounding box is present:
* Because xmin, ymin, xmax, and ymax are required fields, they are always present in the bounding box

...maybe?

Comment thread Geospatial.md Outdated
An invalid geospatial value refers to any of the following cases:

* `null`: A null value in Parquet.
* A non-`null` value that are encoded in a valid WKB or bounding box format
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally it might be better to inline the relevant pieces above (I've made suggestions...it is mostly just NaN). This section could perhaps be used to make non-spatial implementors aware that EMPTY geometries exist and that NaN values in geometries are rare and usually result in errors when an operation is performed (or it could be omitted).

Comment thread Geospatial.md Outdated
but are not considered valid under this specification, including:
* `NaN`: Not a Number. For example, `POINT EMPTY` in WKB is represented by a
`Point` with each ordinate value set to an IEEE-754 quiet NaN value.
* `Empty geometries`: Geometries explicitly marked as empty in WKB using
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to detect Empty geometries when collecting the bbox in Parquet? I still don't know what to do if I were to implement this rule.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really a WKB reader-specific behavior. I added the following line:

While different WKB readers may interpret such values differently, the resulting output should be treated as invalid

Comment thread Geospatial.md Outdated
* `Empty geometries`: Geometries explicitly marked as empty in WKB using
indicators such as `numPoints`, `numRings`, or `numGeometries`. Examples
include `LINESTRING EMPTY` or `POLYGON EMPTY`.
* `Out-of-bounds coordinates`: Values that fall outside the valid range
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are writers encouraged to drop X values when Out-of-bounds coordinates has been detected or it is the readers' choice to check this rule? Or both?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the writers are required to drop X values when Out-of-bounds coordinates has been detected.

I clarified the readers behavior below

@jiayuasu
Copy link
Copy Markdown
Member Author

@paleolimbot @wgtmac

I made the following changes:

  1. Remove null from the definition of invalid geospatial values, as it is considered a valid value in Parquet.
  2. Exclude invalid bounding box values from the scope of invalid geospatial values.
  3. Clarify that invalid geospatial values refer to coordinate values encoded in valid WKB, but deemed invalid under this specification.
  4. Remove any guidance that instructs readers to validate bounding boxes. The Parquet specification already defines the requirements for bounding boxes, and it is up to the reader implementation to decide whether or not to enforce validation.

Comment thread Geospatial.md Outdated
values for validation.

* A bounding box is present:
* X and Y: Both X and Y of the bounding box must be present.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* X and Y: Both X and Y of the bounding box must be present.
* X and Y: Both X and Y of the bounding box must be present. If any X or Y
value is invalid, this bounding box is not reliable and cannot be used.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m hesitant to merge this because I want to avoid mentioning invalid values in the bounding box; otherwise, we would need to create a separate section to define them.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is important for implementations (although I do think you should just say NaN instead of "invalid"). Perhaps a concrete way to phrase this (unless I'm missing something) would be "If any of xmin, ymin, xmax, or ymax is NaN, the bounding box is not reliable and should not be used".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The guidelines here is for readers to determine whether a bbox is reliable. So I think value is present is not sufficient.

Comment thread Geospatial.md Outdated

* A bounding box is present:
* X and Y: Both X and Y of the bounding box must be present.
* Z: If Z of the bounding box is missing, readers should not assume
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Z: If Z of the bounding box is missing, readers should not assume
* Z: If Z of the bounding box is missing or contains any invalid value, readers should not assume

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "is missing or either zmin or zmax is NaN"?

Comment thread Geospatial.md Outdated
* Z: If Z of the bounding box is missing, readers should not assume
anything about the presence or validity of Z values and may need to
load individual coordinates for validation.
* M: If M of the bounding box is missing, readers should not assume
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* M: If M of the bounding box is missing, readers should not assume
* M: If M of the bounding box is missing or contains any invalid value, readers should not assume

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "is missing or either mmin or mmax is NaN"?

Comment thread Geospatial.md Outdated
# Invalid geospatial values

An invalid geospatial value refers to the coordinate values of a non-`null`
geospatial instance that are encoded in a valid WKB format, but are not
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we have mentioned a valid WKB format, do we need to provide guidelines for invalid WKB format?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s very difficult to enumerate all possible invalid WKB formats, and chasing this could easily become another rabbit hole. Since the WKB standard clearly defines what a valid format is, I suggest we leave this out.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Is there any well-known reference for this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we already provided references to the WKB standard?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point that if you're not going to go in to invalid WKB formats, mentioning a valid one is confusing. I think this section is better labeled as "Special geospatial values" since many of the things you mention in it are perfectly valid (and common) cases.

Comment thread Geospatial.md Outdated
Comment thread Geospatial.md Outdated
Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you again for drafting this language!

Comment thread Geospatial.md Outdated
values for validation.

* A bounding box is present:
* X and Y: Both X and Y of the bounding box must be present.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this one is important for implementations (although I do think you should just say NaN instead of "invalid"). Perhaps a concrete way to phrase this (unless I'm missing something) would be "If any of xmin, ymin, xmax, or ymax is NaN, the bounding box is not reliable and should not be used".

Comment thread Geospatial.md Outdated
Comment thread Geospatial.md Outdated
Comment thread Geospatial.md Outdated
Comment thread Geospatial.md Outdated
# Invalid geospatial values

An invalid geospatial value refers to the coordinate values of a non-`null`
geospatial instance that are encoded in a valid WKB format, but are not
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point that if you're not going to go in to invalid WKB formats, mentioning a valid one is confusing. I think this section is better labeled as "Special geospatial values" since many of the things you mention in it are perfectly valid (and common) cases.

Comment thread Geospatial.md Outdated
jiayuasu and others added 2 commits April 29, 2025 23:53
Co-authored-by: Dewey Dunnington <dewey@dunnington.ca>
@jiayuasu
Copy link
Copy Markdown
Member Author

Hi @wgtmac @paleolimbot , I updated the PR again. I replaced most occurrences of invalid with special to reduce confusion. Please take a look.

Copy link
Copy Markdown
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for iterating! That last section is clearer to me now. Two minor comments about other things that could be NaN if you're inclined, but looks good to me!

Comment thread Geospatial.md Outdated

* A bounding box is present:
* X and Y: Both X and Y of the bounding box must be present.
* Z: If Z of the bounding box is missing, readers should not assume
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "is missing or either zmin or zmax is NaN"?

Comment thread Geospatial.md Outdated
* Z: If Z of the bounding box is missing, readers should not assume
anything about the presence or validity of Z values and may need to
load individual coordinates for validation.
* M: If M of the bounding box is missing, readers should not assume
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "is missing or either mmin or mmax is NaN"?

Comment thread Geospatial.md Outdated
jiayuasu and others added 2 commits April 30, 2025 08:41
@jiayuasu
Copy link
Copy Markdown
Member Author

All fixed. Please review again. Thank you! @wgtmac @paleolimbot

Comment thread Geospatial.md Outdated
* `null` instance: Skip it and continue processing the remaining
geospatial instances. Do not produce a bounding box if all instances are null.
* Non-`null` instance with [special geospatial values](#special-geospatial-values):
* X and Y: Skip any special X or Y value and continue processing the
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question here, do we mean: skip any geospatial instance with special X and Y value

(ie, if a linestring has a thousand coordinates, but only one has bad X,Y value, its enough to skip whole instance?)

Copy link
Copy Markdown
Member Author

@jiayuasu jiayuasu Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, what we mean is: if a linestring has a thousand coordinates and only 1 coordinate has a bad Y value, we skip this single Y value, continue aggregating on other Y values in this linestring instance

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. sorry i have another confusion in the doc, I know we define special geospatial value for the geospatial instance. But then in the language we say skip special X or Y value, we dont define what is a special coordinate

Copy link
Copy Markdown
Member Author

@jiayuasu jiayuasu Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a LineString object as an example LINESTRING (0 NaN, 0 1, 1 2):

We have 3 levels here:

  1. A geospatial instance: this linestring object
  2. A geospatial coordinate:
    • we have 3 geospatial coordinates 0 NaN , 0 1, 1 2
  3. A geospatial value:
    • For X, we have 0, 0, 1
    • For Y, we have NaN, 1, 2

The special geospatial value in this spec refers to the 3rd level. When calculating bbox, we only skipped that single bad value NaN, and use X (0, 0, 1) and Y (1, 2).

That's why we didn't have a definition for special coordinate (2nd level) because we look at the 3rd level directly

Copy link
Copy Markdown
Member

@szehon-ho szehon-ho Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK sorry i equated levels 2 and 3 in my head (coordinate and value).

But currently the language of 'special geospatial value' from the link mentions things like 'Nan (point)' , 'empty geometry', and 'invalid WKB' which seems to talk about invalid instances (level 1), which is my confusion. Then the section talks not about those, but about if actual x,v,z,m values (level 3) are invalid. Individual value is invalid only if NaN and outside the range (depending on context), right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szehon-ho I think you’re right. The current wording conflated axis values and coordinates. I’ve updated the special geospatial value section to clarify the distinction.

Comment thread Geospatial.md Outdated
remaining X or Y values. Do not produce a bounding box if all X or all Y
values are special values.

* Z: Skip any special Z value and continue processing the remaining Z values.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we just omit 'continue processing the remaining...' in all these sections, to reduce the verbosity , as its a bit obvious?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you don't have a strong opinion here, I kind of want to keep it verbose. We tend to make it clear that one should only omit individual X/Y/Z/M value and continue using the remaining values, instead of dropping the coordinate or whole geospatial instance.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i think its fine with me. For me It'd probably be more helpful if somehow we had that context (talking about continuing with rest of coordinates in an instance), than it is now. Thinking if we can put this somewhere, but agree not in this line though.

Comment thread Geospatial.md Outdated
Comment thread Geospatial.md Outdated
@jiayuasu
Copy link
Copy Markdown
Member Author

jiayuasu commented May 5, 2025

@wgtmac Please review. Thank you!

Comment thread Geospatial.md Outdated
@jiayuasu
Copy link
Copy Markdown
Member Author

@wgtmac @rdblue The Iceberg clarification PR has been merged. I updated the Parquet PR accordingly. Please review and approve.

Copy link
Copy Markdown
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @jiayuasu !

Comment thread Geospatial.md Outdated
@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 12, 2025

Looks great. Thanks for the update, @jiayuasu!

@rdblue rdblue merged commit 6c8ee40 into apache:master May 12, 2025
4 checks passed
@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 12, 2025

Thanks, everyone! I merged this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants